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Abstract 

In this paper, we consider the problem of hyper-sparse aggregation. Namely, 
given a dictionary F — {/i, . . . , /m} of functions, we look for an optimal aggre- 
gation algorithm that writes / = Ysj=i ®jfj with as many zero coefficients 8j as 
possible. This problem is of particular interest when F contains many irrelevant 
functions that should not appear in /. We provide an exact oracle inequality 
for /, where only two coefficients are non-zero, that entails / to be an optimal 
aggregation algorithm. Since selectors are suboptimal aggregation procedures, 
this proves that 2 is the minimal number of elements of F required for the con- 
struction of an optimal aggregation procedures in every situations. A simulated 
example of this algorithm is proposed on a dictionary obtained using LARS, 
for the problem of selection of the regularization parameter of the LASSO. We 
also give an example of use of aggregation to achieve minimax adaptation over 
anisotropic Besov spaces, which was not previously known in minimax theory 
(in regression on a random design). 

Keywords. Aggregation ; Exact oracle inequality ; Empirical risk minimization ; 
Emprical process theory ; Sparsity ; Minimax adaptation 



1 Introduction 

1.1 Motivations 

In this paper, we consider the problem of sparse aggregation. Namely, given a dic- 
tionary F = {/i, . . . , fm} of functions, we look for an optimal aggregation algorithm 
that writes / = &jfj w ith as many zero coefficients 9j as possible. This question 

appears when one wants to use aggregation procedures to construct adaptive proce- 
dures. Indeed, in practice many elements of the dictionary appear to be irrelevant. 
We would like to remove completely these irrelevant elements of the dictionary from 
the final aggregate, while keeping the optimality of the procedure ( "optimality" is 
used in reference to the definition of "optimal aggregation procedure" provided in 
|Tsybakov| p0 03b ) and Lecue and Mendelson| ( p00 9a)). Moreover, one could imagine 



large dictionaries containing many different types of estimators (kernel estimators, 
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projection estimators, etc.) with many different parameters (smoothing parameters, 
groups of variables, etc.). Some of the estimators are likely to be more adapted than 
the others, depending on the kind of models that fits well to the data. So, we would 
like to construct procedures that can adapt to different models combining only the 
estimators, contained in the dictionary, that are the more adapted for this model. 

Up to now, optimal procedures are based on exponential weights (cf. Juditsky 
et al. (20081, Dalalyan and Tsybakov ( 2007| )) providing aggregation procedures with 



no zero coefficients, even for the worse elements in the dictionary. An improvement 
going in the direction of sparse aggregation has been made using a preselection step 
in Lecue and Mendelson (2009a I. This preselection step allows to remove all the 



estimators in F which performs badly on a learning subsamplc. 

In the present work, we prove that optimal aggregation algorithms with only two 
non-zero coefficients exists, see Section [2j Theorem [T] This means that the aggregate 
writes as a convex combination of only two elements of F. Then, we propose an 
original proof of an already known result, involving an explicit geometrical setup, 
of the fact that selecting a single element of F using empirical risk minimization is 
a suboptimal aggregation procedure, see Theorem [2] Finally, we use our "hyper- 
sparse" aggregate on a dictionary "consisting" of penalized empirical risk minimizers 
(PERM). The aim is to construct, as an application of the previous analysis, an 
adaptive estimator over anisotropic Besov balls, namely an estimator that adapts to 
the unknown anisotropic smoothness of the regression function, in the sense that it 
achieves the optimal minimax rate without an a priori knowledge of the anisotropic 
smoothness parameters. This result was not, as far as we know, previously proposed in 
minimax theory. To do so, we use recent results by Mendelson and Neeman (2009 1 on 
regularized learning together with our oracle inequality for the hyper-spare aggregate. 



1.2 The model 

Let be a measurable space endowed with a probability measure /i and v be a 
probability measure on x K such that /i is its marginal on f2. Assume (X, Y) and 
D n := (Xi,Yi)f =1 to be n + 1 independent random variables distributed according 
to v. We work under the following assumption. 

Assumption 1. We can write 

Y = f (X)+e, (1) 
where e is such that E(e\X) — and E(e 2 |A) < a 2 a.s. for some constant a £ > 0. 

We will assume further that either Y is bounded: ||^||oo < +oo, or that e is 
subgaussian: ||e||^ 2 '■— inf{c > : E[exp((e/c) 2 )] < 2} < +oo, see below. We want to 
estimate the regression function /o using the observations D n . If / is a function, its 
error of prediction is given by the risk 

R(f)=E(f(X)-Y) 2 , 

and if / is a random function depending on the data D n , the error of prediction is 
the conditional expectation 

R(f)=E[(f(X)-Y) 2 \D n }. 
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Given a set of functions F, a natural way to approximate /q is to consider the em- 
pirical risk minimizer (ERM), that minimizes the functional 

1 - 

RnU) :=-^(r 4 -/(X 2 )) 2 

i=l 

over F. This very basic principle is at the core of the procedures proposed in this 
paper. We will also commonly use the following notations. If f F £ argmin^ eF R(f), 
we will consider the excess loss 

C f = C F (f)(X,Y) := (Y /(A)) 2 -{Y- f F {X)f, 

and use the notations 

1 - 

PC, := EC f (X, Y), P n C f := - J] C f (X u Y t ). 



2 Hyper-sparse aggregation 



2.1 The aggregation problem 

Assume that we are given a finite set F = {fi, . . . , Jm} of functions (usually called a 
dictionary), the aggregation problem is to construct procedures / (usually called an 
aggregate) satisfying inequalities of the form 



R(f)<cmmR(f) 



r(F,n), 



(2) 



where the result holds with high probability or in expectation. Inequalities of the 
form Q are called oracle inequalities and r(F,n) is called the residue. We want the 
residue to be as small as possible. A classical result (cf. Juditsky et al. ( |2008[ )) says 
that aggregates with values in F cannot mimic exactly (that is for c = 1) the oracle 
faster than r(F,n) ~ ((log M)/n) 1 ^ 2 . Nevertheless, it is possible to mimic the oracle 
up to the residue (log M)/n (see |Juditsky et aT (20081 and Lecue and Mendelson 
(2009a), among others). 



An aggregate typically write as a convex combination of the elements of F, namely 

M 
3 = 1 

where 6 := {0j(D n , F))f =1 is a map {!,..., M} -> 6, where 



M 

:= {A E(M+) M :$>, = !}. 



i=l 



Popular examples of aggregation algorithms are the aggregate with cumulated ex- 



et al. (2008 20051; Audibert (2009), where the weights are given by 



poncntial weights (ACEW), see Catoni (20011; Leung and Barron (20061; Juditsky 



g(ACEW) 



1 n 



ex P (-Eti(^-/ J (^)) 2 /r) 
! Efii ex P (- £li(*WK^)) 2 /T) ' 
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where T is the so-called temperature parameter, and the aggregate with exponential 
weights (AEW), see Dalalyan and Tsybakov (20071 among others, where 



3 (AEW) 



e*p(-E?=i(iWi(*0)7:r) 



Efiiexp(~£IU(iW^))7T)' 

The ACEW satisfies (|J with c = 1 and r(F n) ~ ( log M)/n, see references above, 



hence it is optimal in the sense of Tsybakov ( 2003b ) . In these aggregates, no coeffi- 
cient equals zero, although they can be very small, depending on the value of R n (fj) 
and T [this makes in particular the choice of T of importance]. In this paper, we look 
for an aggregation algorithm that shares the same property of optimality, but with as 
few non-zero coefficients 0j as possible, hence the name hyper-sparse aggregate. We 
ask for the following question: 

Question 1. What is the minimal number of non-zero coefficients 9j such that an 
aggregation procedure ^2j=i@jfj * s optimal? 

It turns out that the answer to this question is two. Indeed, if every coefficient is 
zero, excepted for one, the aggregate coincides with an element of F, and we know that 



such a procedure can only achieve the rate ((log M)/n) 1 / 2 (see Juditsky et al. (2008) 
and Theorem [2] below where, in the particular case of the ERM, the suboptimality 
of this kind of procedure can be understood from a geometrical point of view (this 



differs from the statistical point of view from |Juditsky et al. ( 2008 1 which involves 



"min-max" type theorem)). In Definition [TJ we construct three procedures, where 
two of them (see ^ and Q ) , only have two non-zero coefficients 6j , and we prove 
in Theorem [T] below that these procedures are optimal. We shall assume one of the 
following. 

Assumption 2. One of the following holds. 

• There is a constant b > such that: 

nmx(||Y r || 00 ,sup||/|| 00 )<&. (3) 

• There is a constant b > such that: 

max(|| £ || v , 2 , || sup \f(X) - fo(X)\\\^ 2 ) < b. (4) 
feF 

Note that Assumption Q allows an unbounded dictionary F. The results given 
below differ a bit depending on the considered assumption (there is an extra log n 
term in the subgaussian case given by Q). To simplify the notations, we assume 
from now that we have In observations from a sample Di n = (Xi,Yi)f2i- 
define our aggregation procedures. 

Definition 1 (Aggregation procedures). Follow the following steps: 
(0. Initialization) Choose a confidence level x > 0. If Q holds, define 

, /log AT- 

4> = <Pn,M\ X ) = b 
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If Q holds, define 



, , , , , . j \ / (log M + x) log n 
V n 

(1. Splitting) Split the sample D 2n into D n l — (Xj,Yi)™ =1 andD n2 = (-X*ij ^)f= n +i ■ 
(2. Preselection) f/se -D rlj i to define a random subset of F : 

F 1 = {feF: R^if) < RnAfns) + cmax (cj>\\f nA - /|| n ,i,0 2 )}, (5) 

^ere \\ff n>1 = n" 1 £? =1 /(X,) 2 , R n .,(f) = n" 1 E?=i(/(*i) " ^) 2 , /n,i G 
axgmin /eF i? ni i(/). 

(3. Aggregation) Choose T as one of the following sets: 

J- = conv(.Fi) = the convex hull of F\ (6) 
J- = seg(-Fi) = the segments between the functions in F\ (7) 
T = stax(/ ni x, F{) = the segments between f n ,i with the elements of Fx, (8) 

and return the ERM relative to D n2 ■ 

f E aigmmR n ^(g), 

where R, h2 (f) = n' 1 Ei=„+i(/(*i) - Y t ) 2 . 

These algorithms are illustrated in Figures [1] and |U In Figure [l] we summarize 
the aggregation steps in the three cases. In Figure [2] we give a simulated illustration 
of the preselection step, and we show the value of the weights of the AEW for a 
comparison. As mentioned above, the Step 3 of the algorithm returns, when T is 
given by Q or (|8| , a function which is a convex combination of only two functions in 
F, among the ones remaining after the preselection step. The preselection step was 
introduced in |Lecue and Mendelson (2009a), with the use of (|6j) in the aggregation 
step. 

Each of the three procedures proposed in Definition [l] are optimal in view of 
Theorem [I] below. From the computational point of view, procedure ^ is the most 
appealing: an ERM in star(/ nl , F) can be computed in a fast and explicit way, 
see Algorithm [T] below. The next Theorem proves that each of these aggregation 
procedures are optimal. 

Theorem 1. Let x > be a confidence level, F be a dictionary with cardinality M 
and f be one of the aggregation procedure given in Definition^ If 

maxdlFlloo, sup H/lloo) < b, 
we have, with v 2n -probability at least 1 — 2e~ x : 

R { f)< f inR(f) + c S 1 + X ^ M , 
feF n 
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where c\, is a constant depending on 6, and where we recall that R(f) = E[(Y — 

/(x)) a |(x i ,y«)£ 1 ]. // 

max(||e||^ 2 , || sup \f(X) - /0POIIU2) < 6, 
feF 

we have, with v 2n -probability at least 1 — 4e~ x : 

v>(i\s ■ T><t\A (1 + x)logMlogn 

R{f) < mm R{f) + c ae . b . 

/G-F n 

Remark 1. Note that the definition of the set F\, and thus /, depends on the confi- 
dence x through the factor 4> h .m(x). 

Remark 2. To simplify the proofs, we don't give the explicit values of the constants. 
However, when ^ holds, one can choose c = 4(1+96) in ^ and c = ci(l+6) when Q 
holds (where c\ is the absolute constant appearing in Theorem [5]). Of course, this is 
not likely to be the optimal choice. 

2.2 The star-shaped aggregate 

In this section we give details for the computation of the star-shaped aggregate, 
namely the aggregate / given by Definition [l] when T is Q. Indeed, if A € [0, 1], we 
have 

fln,a(A/ + (1 - %) = Ai*n, 2 (/) + (1 - A)i?„, 2 ( 5 ) - A(l - A) ||/ - gf n>2 , 
so the minimum of A i— > i?„ j2 (A/ + (1 — X)g) is achieved at 

\ ft ^ nw 1 f R nM9) - RnM) , A A , 

A„, 2 (/, 5) = V - ( || / _ p | | ^ 2 J ' 

where a V 6 = max(a, 6), a A 6 = min(a, 6) and min,\g[o.i] ^n, 2 (A/ + (1 — A)<?) is thus 
equal to i?„, 2 (A„, 2 (/,.?)/ + (1 - A„ )2 (/,g))#) given by 

^n, 2 (/) if i?„, 2 (/) - ^„, 2 (5) > 11/ - ff||*, a 

fl - 2(/) r- 2(9) - ( ^{g:^f ))a - if - < 11/ " 9\\la 

R n ,2(g) otherwise. 
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Figure 2: Empirical risk Rn,i(f), value of the threshold R n ,i(fn,i) + 2max(^>|| f Ut i — 
f\\ n ,i,<t> 2 ) and weights of the AEW (that we rescaled for illustration purpose) for 
/ € F, where F is a dictionary obtained using LARS, see Section [3] below. Only 
the elements of F with an empirical risk smaller than the threshold are kept from 
the dictionary, see Definition (JTJ. The first and third examples correspond to a case 
where an aggregate with preselection step improves upon AEW, while in the second 
example, both procedures behaves similarly. 

This leads to the following algorithm for the computation of /. 

Algorithm 1: Computation of the star-shaped aggregate. 

Input: dictionary F, data (Xi,Yi)f2i, and a confidence level x > 

Output: star-shaped aggregate / 

Split Dm into two samples D n \ and D Ul 2 

foreach j € {1, . . . , M} do 

Compute Rn,i(fj) and Rn^ifj), and use this loop to find 
/n,l G argmin /eFJ R nil (/) 
end 

foreach j € {1, . . . , M} do 

| Compute \\fj - /„,i||„,i and \\fj - /„,i||„,2 
end 

Construct the set of preselected elements 

Fi = {/ G F : #„,!(/) < i?„,i(/„,i) + cmax (0||/„,i - /||„,i, 2 )}, 

where ^ is given in Definition [T] 
foreach / € Fi do 

compute 

-Rn,2(A n ,2(/n,l, /)/n,l + (1 ~ A n) 2(/n,lj /))/) 

and keep the element /y G i^i that minimizes this quantity 
end 
return 

/ = A„,2(/n,l, /y)/n,l + (1 — A n , 2 (/n,l, fj))fj, 
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2.3 Suboptimality of Penalized ERM 



In this section, we prove that minimizing the empirical risk (or a penalized 

version, called PERM from now on) on F(A) is a suboptimal aggregation procedure 
both in expectation and deviation. According to Tsybakovl ( 2003bh , the optimal rate 



of aggregation in the gaussian regression model is (log M)/n. This means that it 
is the minimum price one has to pay in order to mimic the best function among a 
class of M functions with n observations. This rate is achieved by the aggregate 



Catoni 


(2001), 


Yang ( 


2000 


1 and 


Juditsky 



achieve this rate and thus, that it is suboptimal compared to the aggregation methods 
with exponential weights. The lower bounds for aggregation methods appearing in 



the literature (see Tsybakov (2003b); Juditsky et al. (20081; Lecue ( 2006[ )) are usually 
based on minimax theory arguments. In particular, in Tsybakov ( 2003b[ ), it is proved 
that a selector (that is an aggregation procedure taking its values in the dictionnary 
itself) cannot mimic the oracle faster than yj (log M)/n. This result implies the one 
that we have here, but, it doesn't provide an explicit setup for which a given selector 
performs poorly. The result in | Juditsky et al. (2008) says that whatever the selector 
is, there exists a probability measure and a dictionnary for which it cannot mimic the 
oracle faster than yl (log M)/n. The proof of this result does not tell explicitely which 
probabilistic setup is bad for this selector. In the present result, we are interested in 
a particular type of selector: the PERM for some penalty. We can provide an explicit 
framework (dictionnary+probabilistic setup) because the argument considered here 
is based on some geometric considerations (in the same spirit as the lower bound 
obtained in |Lee et al.| ( |1996| and Mendelson (2008 1). The explicit example that makes 
the PERM fail is the following Gaussian regression model with uniform design: 

Assumption 3 (G). Assume that e is standard Gaussian and that X is univariate 
and uniformly distributed on [0, 1]. 

The dictionary is constructed as follow: 




Figure 3: Example of a setup in which 
{/i, . . . , /m } is the dictionary from which 
is the regression function. 

For the regression function we take 




ERM performs badly. The set F(A) = 
we want to mimic the best element and /q 



if X W = i 
if = o, 
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where x has the dyadic decomposition x = X}fc>i x 2 fe where x^ fc ^ G {0, 1} and 



4 V n 

We consider the dictionary of functions Fm = {/i, ■ • ■ , Jm} 

f j (x)=2xW-l, :j- {! M\. (10) 

where again (jw' : j > 1) is the dyadic decomposition of x G [0, 1]. 

Theorem 2. There exists an absolute constant cq > sucft £/iai ifte following holds. 
Let M > 2 be an integer and assume that (G) holds. We can find a regression function 
fo and a family F(A) of cardinality M such that, if one considers a penalization 
satisfying |pen(/)| < C^(log M)/n, V/ G F(A) with < C < <r(24>/2c*) _1 (c* is an 
absolute constant from the Sudakov minorization, see Theorem^in Appendix A. 2), 
the PERM procedure defined by 

f n G argmin(i?„(/) + pen(/)) 
satisfies, with probability greater than cq, 



ll/n-/olr > mm J|/-/o|| 2 + C 3 



logM 



/G-F(A) V n 

for any integer n > 1 and M > M (a) such that n" 1 log[(Af - 1)(M - 2)] < 1/4 
where C3 is an absolute constant. 

This result tells that, in some particular cases, the PERM cannot mimic the 
best clement in a class of cardinality M faster than ((log M)/n) 1 '' 1 . This rate is 
very far from the optimal one (log M)/n. Of course, one can say that the PERM 
fails to achieve the optimal rate only in the very particular framework that we have 
constructed here. Nevertheless, this approach can be generalized (we refer the reader 
to Lecue and Mendelson ( 2009b ) for instance) . Finally, remark that classical penalty 
functions are of the order [Complexity of the class] divided by n, which is in our 
aggregation setup of the order of (log M) /n. Thus, the restriction that we have on 
the penalty function covers the classical cases that one can meet in the litterature on 
penalization methods. 

Let F(A) be the set that we consider in the proof of Theorem [2] (see Section [5] 
below), and take pen(/) = 0. Using Monte-Carlo (we do 5000 loops), we compute 
the excess risk E\\f n — / || 2 — min^ eF ( A ) ||/ — /o|| 2 of the ERM. In Figure [i] below, we 
compare the excess risk and the bound ((log M^jn) 1 !" 1 for several values of M and n. 
It turns out that, for this set F(A), the lower bound ((log M )/n) 1 / 2 is indeed accurate 
for the excess risk. Actually, by using the classical symmetrization argument and the 
Dudley's entropy integral (or Pisier's inequality), it is easy to obtain an upper bound 
for the excess risk of the ERM of the order of ((log M)/n) 1 /' 2 for any class F(A) of 
cardinality M . 

As an application of the aggregation algorithm [T] we consider the problem of 
adaptation to the regularization parameter of a penalized empirical risk minimization 
procedure, denoted for short PERM in what follows. 
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Figure 4: The excess risk of the ERM compared to {(logM) /n) 1 / 2 for several values 
of M and n (a;-axis) 



3 An example of dictionary: Penalized ERM 
3.1 Definition and tools 

Let us fix a function space J 7 , endowed with a seminoma | • The set T is a space of 
functions, such as a Sobolev, Besov or Reproducing Kernel Hilbert Space (RHKS), the 



latter being a common example in regularized learning, see Cucker and Smale ( 2002 1 



A simple example (in the one dimensional case) is the Sobolev space W| of functions 
such that \f\yr = / f^(t) 2 dt < +oo, which corresponds to the so-called smoothing 
splines estimator, see |Wahba| (1990 )] . A PERM (which stands for penalized empirical 
risk minimization) minimizes the functional 



1 - 

-^(K i -/(X J )) 2 + pen(/) 

71 ^ ' 



(11) 



over J 7 , where pen(/) is a quantity measuring the smoothness (or "roughness" ) of / £ 
T. Typically, the penalization term writes pen(/) = /i 2 |/|^- (see van de Geer (2000) 



and Gyorfi et al. (2002) among others), where h > is a regularization parameter. 



In Mendelson and Neeman (20091, sharp error bounds for the PERM are es- 



tablished, in the general context of a so-called ordered and parametrized hierachy 
{J- r : r > 0}. An example of such an ordered and parametrized hierachy is 

T r = rT x , where T x = {/ e T : \fW < 1}. 

In the latter paper, a very sharp analysis is conducted when T is a RKHS, allowing 
for penalizations less than quadratic in the RKHS norm. In this section, we use the 
tools proposed in Mendelson and Neeman| (2009} to derive an error bound for the 
PERM using the standard penalty pen(/) = h' z \f\^, but when T is a Besov space. In 
nonparametric estimation literature, Besov spaces are of particular interest since they 
include functions with inhomogeneous smoothness, for instance functions with rapid 
oscillations or bumps. Moreover, since the design random variable X is eventually 
multivariate, the question of anisotropic smoothness naturally arises. Anisotropy 
means that the smoothness of the regression function f differs in each direction. 
As far as we know, adaptive estimation of a multivariate curve with anisotropic 
smoothness was previously considered only in Gaussian white noise or density models, 
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see Kerkyacharian et al. (20011, Hoffmann and Lepski (2002), Kerkyacharian et al 



( 2007 1 , Neumann ( 2000 1 . There is no result concerning the adaptive estimation of the 
regression with anisotropic smoothness on a general random design X. In order to 
simplify the definition of the anisotropic Besov space, we shall assume from now that 
il = M. d . Let us consider the following compactness assumption on the unit ball T\. 
It uses metric entropy, which is a standard measure of the compactness in learning 
theory, see Cucker and Smale (2002) for instance. Recall that H/Hoo = sup^ggd |/(a;)|, 
and denote by C(K a ) the set of continuous functions on M. d , endowed with the L°°- 



norm. If T\ C C(E rf ), we introduce H^Ti.S) = log A^oo ( T\ , S) , where N^lFi^S) is 
the minimal number of L°°-balls with radius S needed to cover T\ . 

Assumption 4 (Cp). Assume that T embeds continuously in C(M. d ), and that there 
is a number (3 G (0, 2) such that for any 5 > 0, the unit ball of T satisfies: 



(12) 



where c > is independent of 6. 

This assumption entails -ffoo(^, J>) < c(r/5)^ for any r > 0. Moreover, the 
continuous embedding gives that ||/||oo < c l/l^ f° r an y / € T. Assumption [3] is 
satisfied by barely all the smoothness spaces considered in nonparametric literature 
(at least when the smoothness of the space is large enough compared to the dimension, 
see below). Let us give an example. Let Bp be the anisotropic Besov space with 
smoothness s = (si, . . . , s<j). This space is precisely defined in Appendix [B] Each 
Si corresponds to the smoothness in the i-th coordinate. The computation of the 

is done in Theorem 5.30 from 



entropy of T\ when T ' = B^ 
if s is the harmonic mean of s, given by 



Triebel 



(2006). Namely, 



1A1 



(13) 



i=l 



then the unit ball T\ of T = B^ q satisfies Assumption |4] with (3 = d/s, given that 
s > d/p, which is the usual condition to have the embedding in C(M. d ). 



3.2 Local complexity using entropy 



Now, we have in mind to use Theorem 2.5 from Mendelson and Neeman (20091, in 



order to derive a risk bound for the PERM. For this, we need a control on the local 
complexity of J- r , for any r > 0. The complexity is measured in this paper by the 
expectation E||P — Pn||v r x> where for r, A > 0, V r \ is the class of excess losses 



V, 



{aC r j : < a < \,f G T r , E(a£ r j) < A}, 



where 



C rJ := {Y-f{X)f-{Y-f;.{X)f 

and /* G argminy g jp^ E(Y — f{X)) 2 . The next Lemma (a proof is given in Appendix 
A. 3) gives a bound on this measure of the complexity under Assumption [4] 

Lemma 1. Assume that ||V||oo < +oo and grant Assumption |7J One has, for any 

r l+/3/2 A (l-/3/2)/2- 



r,A > 0: 



m\p-p n \ 



< cmax 



r 2 n -l/(l+/J/2) 
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wh 



ere c = cp^y^ 



This Lemma, although probably not optimal, is sufficient to provide a satisfactory 
risk bound for the PERM with a penalization of the form pen(/) — h 2 \f\ 2 :F - It is close 
in spirit to a bound proposed in Loustau (2009) (see Theorem 1) for the problem of 
classification framework using a Besov penalization, with an extra assumption on the 
inputs Xi, since the proof involves a decomposition on a wavelet basis. Here we use 
only the entropy condition, together with some basic tools from empirical process 
theory, see the proof in Appendix |A.3| 



3.3 A risk bound for the PERM using entropy 

Now, we can derive a risk bound for the PERM using Lemma [T] and the results from 
Mendelson and Neeman ( 2009[ ). First, note that 



A/8 > cmax 



r 2 n -i/(i+/3/2) 



r l+/3/2 A (l-/3/2)/2 



if and only if A > cr 2 n^ 1 ^ 1+ ^^ 2 K So, any A > cr 2 n^ 1 ^ 1+l3 / 2 ^ satisfies, using 
Lemma [I] that A/8 > E||P — P„||y r A , and consequently, using the "isomorphic coor- 
dinate projection" [see Theorem 2.2 in Mendelson and Neeman (2009)], we have that 
for any / £ T r , the following holds w.p. larger than 1 — 2e~ x : 



P n £>r,f ~ Pn(r, x) < PC r j < 2P n £ rJ + p n (r, x), 



where 



p n (r,x) := c(r 2 n- 1 /( 1 +^ 2 ) + ^l±f>^ 



(14) 



This explains the shape of the usual quadratic penalization pen(/) = /i 2 |/|^r, where 
h = en. -1 ' ( 2+ ^ (up to the other term, which is of smaller order l/n), and this entails 
the following. 

Theorem 3. Assume that Halloo < b, and grant Assumption^ Let p n (r : x) be given 
by ( 14 1 and define for r,y > 0: 



8(r, y) = y + log(7T 2 /6) + 21og(l + cn + logr), 

where c — cp^- Then, for any x > 0, with probability at least 1 — 2exp(— x), any 
f G T that minimizes the functional 



P n £ f + c lPn (2\f\r,8(\f\r,x)) 



over T also satisfies 



Plj < uf [Pt { + c 2Pn (2\f\r, 0(\f\r, x)) 



Proof. The conditions of Theorem 2.5 in Mendelson and Neeman (2009 1 are satisfied 



with p n (r,x) given by (14 1. The statement of the Theorem easily follows from it, 
using the same arguments as in the proof of Theorem 3.7 herein. □ 
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Let us rewrite the result of Theorem [3] For any x > 0, if / is the PERM at level 
x, one has, with ^"-probability larger than 1 — 2e~ x : 



P£ ¥ < inf 

1 r>0 



1 Co ( 1 + r 2 ) 
P£ fr +cirV!W5 + Ma;- 



7T ~| 

lo g(y) + log(l + c 3 n + log r)) I , 



where we recall that P^r = E[(Y - /(X)) 2 |X l5 . . . , X„] and f r S argmin /6 ^ r R(f)- 
This inequality proves that / adapts to the radius \f\j? of / in T . The leading term 
in the right hand side of this inequality is r 2 n~ 2 /( 2+,3 \ If T = B^^, it becomes 
, which is the minimax optimal rate of convergence over anisotropic 



r n 



-2s/(2s+d) 



Besov space, see Kerkyacharian et al. (20071 for instance. 



3.4 Adaptive estimation over anisotropic Besov space 

What we have in mind now is the application of Theorems [T] and [3] to the problem 
of adaptive estimation over a collection of anisotropic Besov space. Consider two 
vectors s mm and s max in with positive coordinates and harmonic means s mm and 
s max respectively, satisfying s min < s max (s™ in < s™ ax for any i £ {1, ...,d}) and 
gimn > ^1 m j 11 ^p j 2). Consider the collection of anisotropic Besov space 

d 

(B^ :aeS), where S := JJ^, s?™]. (15) 

i=i 

The strategy is to aggregate a dictionary of PERM, corresponding to a discretization 
of S, in order to adapt to the anisotropic smoothness of fo- The steps are the 
following. We shall assume to simplify that we have 2n observations. 

Definition 2 (Adaptive estimator). 

1. Split (at random) the whole sample {X%, 5^)f=i into a training sample (X%, Y)™ =1 
and a learning sample (Xi, i^)f™„ +1 - Fix a confidence level x > 0. 

2. Compute the uniform discretization of S with step (log?!) -1 : 

d 

Sn ■= I] I 5 '"' 1 + ^(log n)- 1 : 1 < k < [( S f ax - s?™) log n] } . (16) 

i=l 

Then, for each s € S n , take f s as a minimizer of the functional 

1 - 
n z — ' 

i=l 

w/iere 

pen s (/,x) = Cl n- 2 ^( 25+d )|/||, oo 

C 2 (l + |/||» ), /7r 2 

+ (* + log(-g-) + log(l + c 3 n + log |/| B| ,J). 

7/6 is smc/i i/iai ||Y||oo _• consider the dictionary of truncated PERM 
F^u = { _ by j sAb . seSn} 
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3. Using the learning sample (Xi,Yi) 2 ™ n+1 , compute one of the aggregates f given 
in Definition^ using the dictionary i? PERM . 

The next Theorem, which is an immediate consequence of Theorems [T] and [3] 
proves that the aggregate / is minimax adaptative over the collection of anisotropic 



Besov spaces ( 15 1 



Theorem 4. Let f be the aggregated estimator given in Definition Assume that 
maxdiyiloo, H/olDoo < b and that f G B*^ for some s S S, where {B*^ : s G S) 



is the collection given by (15) that satisfies s r 
larger than 1 — Ae~ x , we have: 



> d/p. Then, with v -probability 



11/ - /o|lz, a (/*) 



1 



< cir n 2s o+ d + C2- 
where rg = |/o|s s (Mid 



log log n 



1 

s 



(x + log(7r 2 /6) + clog(l + c 3 n + logr ))}, 



The dominating term in the right hand side is of order n~ 2S o+ d , which is the 
minimax optimal rate of convergence over anisotropic Besov space (a minimax lower 
boun d over B% a can be easily obtained using standard arguments, such as the ones 
from |Tsybakov (2003a I, together with Bernstein estimates over B^ x (that can be 
Q2006 1 for instance). Note that there is no regular or sparse zone 



found m Triebel 



here, since the error of estimation is measured with L 2 (/j,) norm. The result obtained 
here is stronger than the ones usually obtained in minimax theory, where one only 



gives an upper bound for E||/ — /o| 
for llZ-Zollia^)- 



L 2 M 



while here is given a concentration inequality 



4 Simulation study 

In this section, we propose a simulation study for the problem of selection of the 



Tibshirani 


(1996 


I; 


Efron et al. 


(2004 



simulate i.i.d. data 

Yi = [3q X$ + £j, 

where [3q is a vector of size p — 91 given by 

A, = (3, 1.5, 30 , 2, -6, 4, 25 , -4, 15 , 2.5, 3, 10 , 3, 1, -2) 

where 0" is the vector in M. n with each coordinate set to zero. The noise £{ is 
centered Gaussian with variance a 2 . The vector X — (X 1 , . . . , X d ) is a centered 
Gaussian vector such that the correlation between X 1 and X^ is 2~^^ (following 
the examples from Tibshirani ( 1996 1). Using the lars routine from we construct 
a dictionary F made of the entire sequence of LASSO type estimators for various 
regularization parameters coming out of the LARS algorithm. Then we compare the 
prediction error \X{j3 — /3 ) 1 2 and the estimation error |/3 — /3q| 2 where j3 is: 



www.r-project .org 
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p(Cp) — the LASSO with regularization parameter selected using Mallows-C p 



selection rule, see Efron et al. (2004) 



• /3( AEW ) = The aggregate with exponential wei ghts computed on F with tem- 
perature parameter 4a 2 , see for instance iDalalyan and Tsybakov (20071 



s ^( star ) = the star-shaped aggregate, see Algorithm [7] with constant c = 2. 

We compute the errors \X((3 — /9o)|a an d \P — flo\2 using 100 simulations for several 
values of n and a 2 . The splits taken are chosen at random with size n/2 for training 
and n/2 for learning for both the AEW and star-shaped aggregate (we don't split the 
learning sample). For both aggregates we do some jackknife: instead of using a single 
aggregate, we compute a mean of 10 aggregates obtained with several splits chosen 
at random. This makes the final aggregates less dependent on the split. In order to 
make the oracle and the Mallows-C p errors comparable to the error of the aggregates 
(that need to split the data, while Mallows-C p doesn't), we compute the weights 
of aggregation using splitting, then we compute the aggregate using a dictionary F 
computed using the whole sample. 

The conclusion is that, for this example, the star-shaped does a better job than 
both the AEW and the C p in most cases. When the noise level is not too high (a — 2, 
which corresponds to a RSNR of 5), see the errors given in Figure [5] the star-shaped 
is always the best. When the noise level is high (a — 5, RSNR=2) and n is small, 
see Figure |6j the story is different: the AEW is better than the star-shaped. In 
such an extreme situation the AEW takes advantage of the averaging (recall that 
no coefficient is zero in the AEW). However, when n becomes larger than p, the 
star-shaped improves again upon AEW. 

5 Proofs of the main results 
5.1 Proof of Theorem Q] 



Proof of Theorem [7j^ Let us prove the result in the ip% case, the other case is similar. 
Fix x > and let T be 
the second half of the sample 

below), with probability at least 1 — 4exp(— x) (relative to D n ^), we have for every 



either or Q. Set d := diam(F 1; L 2 (fi)). Consider 
mple D„,2 = {Xi,Yi)i=n+v % Corollary [j] (see AppendixjA] 



^ 2n 

} - ^(/)(X i ,y i )-E(£^(/)(X,y)|D n)1 ) < c{a e + b) max(d0, 60 2 ) 



i— 1+n 



where C^(f)(X,Y) := {f{X)-Y) 2 -{f :F {X)-Y) 2 is the excess loss function relative 
to T ', f^ e Arg min^ g ^ R(f) and where <f> = \J ((log M + x) log n) jn. By definition 
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n= 70 sigma= 2 n= 100 sigma= 2 n= ISO sigma= 2 




Oracle Star-shaped AEW Cp Mallows Oracle Star-shaped AEW Cp Mallows Oracle Star-shaped AEW Cp Mallows 



Figure 5: Errors \(3 — Po\2 (first row) and \X((3 — /?o)|2 (second row) for a = 2 and 
n = 70, 100,150. 
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n= 70 sigma= 5 n= 100 sigma= 5 n= 150 sigma= 5 




Oracle Star-shaped AEW Cp Mallows Oracle Star-shaped AEW Cp Mallows Oracle Star-shaped AEW Cp Mallows 



Figure 6: Errors |/3 — /Job (first row) and \X((3 — /?o)|2 (second row) for a = 5 and 
n = 70, 100,150. 
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of /, we have \ Yli" n +i £fi{f)(Xi,Yi) < 0, so, on this event (relative to D ny2 ) 

1 2n 

R(f)<R(f)+E(£?(f)\D nA )-- ^(fXXuYi) (17) 



i=n-\-l 

< R(f ? ) + c(a e + b) max(#, b<p 2 ) 
= R(f F ) + (c(a £ + 6) ma X (#, 60 2 ) - (i?(/ F ) - i?(/^)) 
= :i?(/ F ) + /3, 
and it remains to show that 

(1 + x)logMlogn 

P < c b . CT , . 

n 

When T is given by (|6]) or ([7]) , the geometrical configuration is the same as in 



Lecue 



and Mendelsonj ( |2009a j, so we skip the proof. Let us turn out to the situation where 



T is given by (|8J. Recall that / n> i is the ERM on F\ using D n \. Consider f\ such 
that ||/ n> i - /i||x,2( M ) = max /ei?i ||/ nil - /||z,2( M ), and note that ||/ ni i - fx\\ L 2^ < 

d < 2||/„,i - fi\\L 2 (n)- Tne mid-point / 2 := (/„,i + /i)/2 belongs to star(/ nj i, Fi). 
Using the parallelogram identity, we have for any u, v € Li{y): 

- , ^ + ^\ 2 < E„( M 2 )+E> 2 ) ll"-«lli 2(y) 



where for every ft, S L2{v), E u (h) = Eh(X, Y). In particular, for u(X, F) = — Y 
and v(X,Y) = /i(X) - Y, the mid-point is (u(X, Y) + v(X,Y))/2 = / 2 (xj - Y. 
Hence, 

iZ(/ 2 ) = E(f 2 (x) Yf = e( ^ i(1) 2 +/i(i) - y) 2 

< ^E(^,i(x) - r) 2 + Ie(A(x) - r) 2 - i||/ nil - a||| 2(m) 

<\R{fn,l)+\R{h)-^ 

where the expectations are taken conditioned on D n \. By Lemma|4](see Appendix [A| 
below), since f n ,x,fi € F\, we have 

\R{Ins) + \R{h) < R(f F ) + c(a £ + b) max(<M bcf> 2 ), 

and thus, since /2 £ f 

R{f) < R{h) < R(f F ) + c(a e + b) max(<M b(f> 2 ) - cd 2 . 

Therefore, 

p = c(a s + b) max(#, b<f) - (R(f F ) - R{f)) 
< c(a e + b) max(0d, b(j) 2 ) — cd 2 . 

Finally, if d > c ac ^4> then (3 < 0, otherwise (3 < c ac ^4> 2 - □ 
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Proof of Theorem [£| The dictionary Fm is chosen so that we have, for any j € 
{l,...,M-l} 

5h 2 5h 2 

Wfj ~ /o|lia([ ,l]) = + 1 and ll/iW - /o|||=([0,l]) = ~Y~ k+1 - 

Thus, we have 

bh 2 

j= niin M Wfj - /o|li2([0,l]) = II/-M - /o|Il2([o,i]) = ft+ 1. 

This geometrical setup for F(A), which is a unfavourable setup for the ERM, is 
represented in Figure [3| For 

In ■= f™ M G argmin (R n (f) + pen(/)), 

where we take R n (f) = I E7=i( Y i - f( X i)Y = W Y ~ f\\l, we have 

E||/„ - /o||£ a([0) i]) - . = mm M ||/, - f \\h([o,i]) + hF ifn ¥= fu]- (18) 

Now, we upper bound P[/„ = /m]- We consider the dyadic decomposition of the 
design variable X: 

X = ^2x^2- k , (19) 

fe=i 

where (X^ : fc > 1) is a sequence of i.i.d. random variables following a Bernoulli 
£>(1/2,1) with parameter 1/2 (because X is uniformly distributed on [0,1]). If we 
define 



1 

vn i=i 



we have by the definition of h and since € {—1,1} 

^(r-/^-r-/ 3 -ii*) 



= ^ - ^ + ^ E(cPci M) + 3(cp } - d M) ) - 1) 

v £= 1 

4C , 

>Nj-N M ylog M . 

a 



This entails, for Nm-i '■= maxi<j<jy_i Nj, that 



M-l 



/m] = P[ f) {\\Y - f M \\l -\\Y- fiWl < pen(fj) -pen(/ M )} 



< f\n m > N M -i - — VlogM 

L (7 



It is easy to check that Ni, . . . , Nm are M normalized standard gaussian random 
variables uncorrelated (but dependent). We denote by C the family of Rademacher 
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variables (£ 



l,...,n;j = 1,...,M). We have for any 6C/a < 7 < (2V2C*)" 1 



(c* is the "Sudakov constant" , see Theorem [7| 

6C 



P[/n = hi] < E 
< P 

+ E 



N M > N M -i - 



-TIoiMc) 

(7 / 



iVM > -iV^gM + E(JV M -i|0] (20) 
p{E(i\T M -i|0 - Nm-x > (7- ^)VlogM|c} . 



Conditionally to £, the vector (7Vi, . . . , Nm-i) is a linear transform of the Gaussian 
vector (ei, . . . ,£„). Hence, conditionally to £, (JVi, . . . ,Nu-i) is a gaussian vector. 
Thus, we can use a standard deviation result for the supremum of Gaussian random 
vectors (see for instance Massart (20071, Chapter 3.2.4), which leads to the following 
inequality for the second term of the RHS in (201: 

p{e(JV M -i|C) ~ N M -i > (7 - ^)V^m\c} 

< exp(-(3C*/cr-7/2) 2 logAf). 

Remark that we used E[iVj|£] = 1 for any j — 1, . . . , M — 1. For the first term in the 
RHS of (pi, we have 



N M > -7\/logM + E(N M -i\C) 



< 



N M > -2 7v /logM + E(iV M _ 1 ) 
- 7 0ogAf + E(AT M -i) > E(7Vm-i|C) 



(21) 



Next, we use Sudakov's Theorem (cf. Theorem [7] in Appendix A. 2 1 to lower bound 
E(N M ^i). Since (N±, . . . , A r M _ 1 ) is, conditionally to £, a Gaussian vector and since 
for any 1 < j ' ^ k < M we have 



1 

E[(N k -N j f\C] = ^(Ct ) -C ( i j) ) 2 



then, according to Sudakov's minoration (cf. Theorem [7] in the Appendix), there 
exits an absolute constant c* > such that 



c*E[JV M -i|C] > min 

\<j^k<M- 



Thus, we have 



c*E[jV M _ x ] > E 



1 

mm 

j^k \n 



(k) 



i=l 



1/2 



V / logM 



1 

> v^f 1 - E [max - J2 CW'j ) VlogM, 



where we used the fact that \fx > xj\j2 1 \lx € [0,2]. Besides, using Hoeffding's 
inequality we have E[exp(s£ <J,fc ))] < exp(s 2 /(2n)) for any s > 0, where £tw := 
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n 1 X)™=i CP'Ci"^- Then, using a maximal inequality (cf. Theoremjsjin Appendix 
and since rT 1 log[(M - 1)(M - 2)] < 1/4, we have 



A.2I 



E 



1 



max-Vc (fc) C 0) 



»=l 



This entails 



(-log[(M-l)(M-2)] 



1/2 1 
< -. 

- 2 



(22) 



Thus, using this inequality in the first RHS of (21 1 and the usual inequality on the 
tail of a Gaussian random variable (Nm is standard Gaussian), we obtain: 

p\n m > - 2 7v / logM + E(A M _i)l < f\n m > ((c*^)" 1 - 2 1 )^\ogM 



N M > ((CV2)- 1 - 2 7 )VlogM 
< exp ( - ((c*^)- 1 - 2 7 ) 2 (log M)/2) 



(23) 



Remark that we used 2y2c*7 < 1. For the second term in (21 ), we apply the concen- 
tration inequality of Theorem [6] to the non- negative random variable E[7Vm-i|C]- We 
first have to control the second moment of this variable. We know that, conditionally 
to C, -Mjl C ~ A"(0, 1) thus, Nj\C € Lfa (for more details on Orlicz norm, we refer the 
reader to van der Vaart and Wellner ( 1996[ |). Thus, 



max NjlCUs < K^ X {M) max HA^CIk 

1<J<M-1 rz l<j<M-l 



(cf. Lemma 2.2.2 in 



(19961). Since \\Nj\Q 



1, we have 



. van der Vaart and Wellner 
|| max 1 < J < M _ 1 -^7" I CI I -02 — ^V^°E M. In particular, we have Ejmax 1 <j< A /_ 1 A^ 2 |^] < 
K log M and so E(E[A M _i|C]) 2 < if log M. Then, Theorem provides 



- 7v /logAf + E[A M -i] > E[JV M -i|C]J < exp(- 7 2 /c ), 
where cq is an absolute constant. 



(24) 



Finally, combining (|20|, (|23|), (|21j), (|24| in the initial inequality (|20|, we obtain 

nln = /m] < exp(-(3C/(7 - 7 ) 2 logAf) 

+ exp ( - ((c* V2)- 1 - 2 7 ) 2 (logM)/2) + exp(- 7 2 /c ). 

Take 7 = (1272c*)- 1 . It is easy to find an integer Mo (a) depending only on a such 
that for any M > Mq, we have P[/„ = Im] < c\ < 1, where c\ is an absolute constant. 
We complete the proof by using this last result in ( 18 1. □ 

Proof of Theorem^ Recall that we use the sample D n \ to compute the family F = 
{—b V f s A b : s E S n } of PERM, which has cardinality c(logn)f, and the sample 
D n ,2 to compute the weights of the aggregate /, see Definitional Recall also that 
there is Sq = (so,ij • • • j So,d) € S such that /o € B*^, and denote tq = l/ols'o^- 
Take = • • • , £ S n such that s* j < Sqj < s*j + (logn) -1 for all 

j = 1, . . . ,d. Remark that for this choice, one has Sf ^ C Bp*^ and n - 2s */( 2s *+ rf ) < 
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e <V 2 n -2s /(2s +(2) _ gy a substraction of Pi fo on both sides of the oracle inequality 
stated in Theorem [TJ and since ||/o||oo < b, we can find an event A xrh2 satisfying 
v 2n (A x . n _2) > 1 — 2e _a: and on which: 



\\~f-MW)<Wh, -tf \D n ,i) + c 



(1 + x) log log n 



Now, using Theorem[3] we can find an event A x ^ n< \ satisfying v 2n (A Xtn ^i) > 1 — 2e 
on which: 



c 2 (l + r 2 ) 



< c^n 2B o+ d + 



(x + \og{—) +log(l + c 3 n + log r ))j 
o , c 2 (l + r2) 



=c + lo 8'(^r) + iog'l 1 + c 3™ + log r 
6 



»}■ 



where we used the fact that l/ols 8 ^ < l/ols s ° — r o- This concludes the proof of 
Theorem^ since v 2n (A x>n<1 n A X) „ i2 ) > 1 - 4e~ x . □ 



A Tools from empirical process theory 
A.l Useful results from literature 



The following Theorem is a Talagrand's type concentration inequality (see Talagrand 
( 1996[ )) for a class of unbounded functions. 



Theorem 5 (Theorem 4, Adamczak (2008)). Assume that X, X±, . . . ,X n are inde- 
pendent random variables and F is a countable set of functions such that Ef(X) = 
0,V/ef and, for some a 6 (0, 1], || supj eF f(X)\\^, a < +oo. Define 



I 1 " 

Z:=sup \-J2f(Xi) 



and 



a 2 = sup Ef(X) 2 and b := 
feF 



max i= i ! ... !n sup /6i r 1/(^)111^ 



Then, for any r\ G (0, 1) and 5 > 0, there is c — c atTlt s such that for any x > 0: 



Z> (l + V )EZ + aJ2(l + 6)- +cb(-) 1/C 



x / x\^/ a 
Z < (1 - r))EZ - 0-4/2(1 + 5)- - c6( - 



< 4e _a: 

< 4e _ai . 



A. 2 Some probabilistic tools 



For the first Theorem, we refer to Einmahl and Mason (19961. The two following 



(19961; Ledoux and Talagrand (1991). 



Theorems can be found, for instance, in Massart (20071; van der Vaart and Wellncr 
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Theorem 6 (Einmahl and Masson (1996)). Let Zi,...,Z n be n independent non- 
negative random variables such that E[Zf] < cr 2 ,Vi = 1, . . . ,n. Then, we have, for 
any 6 > 0, 

n r 9 

no 



'[532i-E[2i] < 



< 



exp 



2<r s 



Theorem 7 (Sudakov). There exists an absolute constant c* > such that for any 
integer M , any centered gaussian vector X = {X\, . . . , Xm) in K M , we have, 



c*E\ max XA > eJloeM, 

l<j<M J 



where e := mm {v^lPi ~ x j) 2 } : G {1,... ,M}}. 

Theorem 8 (Maximal inequality). Le£ Yi, . . . , Km &e M random variables satisfying 
E[exp(sij)] < exp((s 2 CT 2 )/2) for any integer j and any s > 0. Then, we have 

Ef max Yl-l < av^ogM. 
l<j'<M 

A. 3 Some technical lemmas 

In this section we state some technical Lemmas, used in the proof of Theorem [l] 
Notations 



Given a sample (Zi)" =1 , we set the random empirical measure P n 



For any function / define (P — P n ){f) n ~ x Y^l=\ fi^i) — E/(Z) and for a class 
of functions F, define \\P — P n ||j? := supy eF \(P — P n )(f)\. In all what follows, we 
denote by c an absolute positive constant, that can vary from place to place. Its 
dependence on the parameters of the setting is specified in place. 



Proof of Lemma^ We first start with lemma 4.6 of Mendelson and Neeman (20091 
to obtain 



E||P - P n \\ Vr , x < 2J2 2 _i E||P - P n \\c r ^ +1> 



(25) 



where £ rj2 i+i^ ■— {A-,/ : / £ - T^E/Vj < 2 i+ 1 A}. Let i G N. Using the Gine-Zinn 
symmetrization argument, see Gine and Zinn (1984), we have 



E\\P-P n \\ CrM+lx <^E {XtY) E f 



cec 



sup ^e i £(X l ,F i ) 



■,2<+ 1 A i=l 



where (ej) is a sequence of i.i.d. Rademacher variables. Recall that there is (see 
Ledoux and Talagrand ( 1991 1) an absolute constant c g such that for any T C K n , we 



have 



E r 



SUp ^ e i f i 



teT 



<c g E g 



sup 

teT 



^2 9tti 



(26) 



where (gi) are i.i.d. standard normal. So, we have 



E\\P-P n \\c r ^ +lx <—E {x ,Y)E g sup \Y,9i£{XuYi) 



cec 
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Consider the Gaussian process / — > Zj := Y^i=x9i£-f{Xi-,Yi) indexed by !F r ^i+i\ := 
{/ G T r : E£ r j < 2 l+1 A}. For every /, /' S -F^+iAj we have (conditionally to the 
observations) 

E g \Z f - Z r \ 2 < 4(||F|U +r) 2 E s |4 - Z},| 2 , 

where Zj- := ^" =1 g l (f(X i ) - f*{X % )) and / r * e argmin /g:Fr R(f). Using the convex- 
ity of T r , it is easy to get E[£ r j] > ||/ — /* || 2 , V/ £ T r . Define 



B r , 2 i + r X ■■= {/ - /; : / G F r , ||/ - /;|| < V¥^X}. 

Using Slepian's Lemma (see, e.g. Ledoux and Talagrand (1991); Dudley ( 1999| ), we 
have: 



E||P-P„|U 4+1> <c Y {r + l)E, 



where we put 



E := -E x Eg 



sup y^QifiXi) 



Moreover, using Dudley's entropy integral argument (again, see |Ledoux and Tala- 



* I 



grand (1991); Dudley (19991; Massart (2007l), and Assumption [51 we have 

12 ' A 
E < —=E X 



< ^E, 



iV^^+u, || • \\ n ,t)dt 

A ... /3/2 



< ^^(ExFA 2 ])' 1 -^ 2 )/ 2 , 

where cp — Vly/cj (1— (3/2) and A := diam(-B n2 *+i.\, ||"||n)- But, one has, if B 2 2i+1A := 
{/ 2 : / G B r 2 i+i\}, using a contraction argument (see Ledoux and Talagrand ( 1991 1, 
Chapter 4), with again a Gine-Zinn symmetrization, 

E X [A 2 ] <E X ||P-P„|| B 2 +2 l+1 X 
< Ac g {r + l)E + 2 l+1 \ 



Hence, E satisfies 



E < ^ r ^ 2 ((r + 1)E + tf+i\)W)t* 
\/n 



thus 



E||P-P n ||^ <max 



l+/3/2^ 2 i+l^)l/2-/3/4 



n 2 /(2+/3) ' 



Plugging the last result in the sum of Equation (25 1 entails the result. 
Lemma 2. Define 

d(F) := diam(F,L 2 (/,*)), a 2 (F) = sup E[/(X) 2 ], C = conv(F), 



□ 
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and C C {C) = {{Y - f{X)f — (Y — f c (X)) 2 : / € C}, where f e argmin geC R(g). If 
max(||F|| 00 ,sup /GJ? ||/||oo) < b, we have 



E 



feF n ^ 

K\\Pn - P\\C C (C) < Cb 



2 . J? . 6 2 logM 



(a 2 (F) 



< cmax|(j z (F), : ) 7 an< ^ 

,d(F)). 



logM / /logM 



max ( b\ 
n \ V n 



//max(||e||^ 2 , || sup /eF \f(X) - /oPO|||</, 2 ) < b, we have 



E 



1 V^^2/^n1 ^ / & 2 logMlogn\ 
sup - 2^ / (Xi)\ < cmax (ct 2 (F), j 



and 



E||Pn - P\\c c (C) < cb 



log M log n 



max ( b 



log M log n 



,d(F) 



n \ V n 

Proof. First, consider the case when || sup^ gF \f{X) — f a {X)\\\^ 2 < b. Define 

n 

feF n ^ 

and note that E x (r 2 ) < E X \\P - P„||f= + cr(F ) 2 , where F := {/ 2 : / e F}. Using 
the same argument as in the beginning of the proof of Lemma 1, sec above, we have 



E x \\P-P n \\F* < -ExE 9 
n 



sup \ y^gif{Xi) 



The process / i— > Z 2 j = X)"=i5i/ 2 (^) is Gaussian, with intrinsic distance 

n 

E g \Z 2J - Z 2J ,\ 2 = J2(f( X if - f\Xi?? < dnMfJT x 4nr 2 , 

i=l 

where d„ )00 (/, /') = max 1=lr .. in -/'pQ)|. So, using Dudley's entropy integral, 

we have 



A ni00 (F) 



logM 



E S ||P-P„|| F 2 < — / JlogN(F,d n ^,t)dt < crA„ iDO (F) 

where A„ jOC is the d„ !00 -diameter of F. So, we get 

E X ||P-P„|| F 2 < c 
which entails that 



y^E x [A„ ;00 (P)r] < cyJ^^y/Ex [A 2 l oo (F)] v^p] , 



E*(r 2 ) < ^^ExIA^^l + ^CP) 2 . 

n 

Since E[Z 2 ] < 2||Z|| 2 2 for a subgaussian variable Z, we have, by using Pisier's in- 
equality, 

E X [A 2 ;00 (F)] < 4|| . max sup - f Q (X t 

»=l,...,n F 



^2 



<41og(n+ 1)|| sup |/(X)-/o(X)||| 2 2 
< 4& 2 log(n+ 1), 
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so we have proved that 



v ( ?\ ^ / 2 logMlogn 2 \ 
fix (r )<c max \ b , a(I< ) J. 



When H/Hoo < ^ the proof is easier, since we can use the contraction principle for 
Rademacher process after the symmctrization argument: 



E x \\P-P n \\ F 2 < -E X E, 



sup|^e,/ 2 (X 4 ) 
feF 1 ' 



i=l 



sup Ve,/(X t ) 
^ 1 



and one obtains from this, as previously, that 



E x (r 2 ) <c max (& 2 ^, °{F) 2 ). 



Let us turn to the part of the Lemma concerning E||P — P n \\c c (c)- Recall that C = 
conv(F) and write for short C f (X,Y) = C c (f)(X,Y) = (Y - f (X)) 2 - (Y - f (X)) 2 
for each / e C, where we recall that f c G argmin 9£C R(g). Using the same argument 
as before we have 



E||P-P n || £c(c) < -E ( x,Y)E g sup ^^(47;) 



Consider the Gaussian process / — ► Zf := J^ILi 9i^f{^i> Y^) indexed by C. For every 
/, /' G C, the intrinsic distance of Zf satisfies 

n 

E 9 1 Zf - Z } ,\ 2 = - ^(Xi.y-)) 2 



i=l 



< 



. max |2K* - /(X,) - .f'(X 4 )| 2 x ]T(/pQ - f'{X t )f 

i=l,. ...n ^— ' 



i=l 

= max ^-/(^-/'(X^ 2 xE g |Z}-Z},| 2 , 

i=l,. ..,n J J 

where Zj- := Y^i=i 9i(f(Xi) — f C {Xi)). Therefore, by Slepian's Lemma, we have for 
every {X U Y^ 



E„ 



supZ/1 < max sup \2Y t - f(X t ) - f'(Xi)\ x EjsupZ', 
fee - 1 i=i, ...,n fji GC L fpC 



and since for every / = Y^f=i a jfj e £■> where aj > 0, V? = 1, . . . , M and a j = 1j 



^/ = Sill 1 we nave 



E, 



sup Zf 

fee 



< E, 



sup Zf 

feF 



Moreover, we have using Dudley's entropy integral argument, 



-E, 



sup Zf 

feF 



< 



A„(F') 



y/N(F,\\-\\ n ,t)dt<C 



logM , 

r , 

n 



2G 



where F' := {/ - f : f € F} and A n (F') := diam(F', || • ||„) and 



1 ™ 

sup -^/(X,) 5 



feF , n _ 

On the other hand, we can prove, using Pisier's inequality for ipi random variables 
and the fact that ||£/ 2 ||,/>i = \\U\\% for every random variable U, that 



E 



max sup \2Yi- f{Xi)- f'{Xi)\ 2 

»=l.-»n fJ'eC 



(27) 



< 2 v /21og(n+l)(|| £ ||^ 2 + || sup \f(X) - f (X)\\y 2 ). 



So, we finally obtain 



E\\P-P n 



\C C (C) 



and the conclusion follows from the first part of the Lemma, since cr(F') < d(F). The 
case maxdjFljoo, supj gi? ||/||oo) < & is easier and follows from the fact that the left 
hand side of (|27l is smaller than 46. □ 



Lemma [2] combined with Theorem [5] leads to the following corollary. 

Corollary 1. Let d{F) = diam(F, L 2 {P X )), C := conv(F) and C f (X,Y) = (Y - 
f(X)) 2 -(Y - f{X)f. //max(Je||^ 2 ,||su P/eF |/(X) - f (X)\\\^) < b we have, 
with probability larger than 1 — Ae~ x , that for every f G C: 



(log M + x) log n 



(log Af + x) log n 



,d(F) 



//maxdlylloo, supj eF H/lloo) < we have, with probability larger than 1 — 4e x , that 
for every /gC: 



1 - 



< cb 



log M + x 



log M + x 



,d(F) 



Proof. We apply Theorem [5] to the process 

I 1 - 

Z:=sup -V Cf(X h Yi) -EC f (X, Y) 
f ee I n jr{ 

to obtain that with a probability larger than 1 — Ae~ x : 

Z <c(EZ + a(C) x f^+b n (C)-), 
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where 



a(C) 2 = supE[C f (X,Y) 2 }, and 
fee 

b n (C) = 



. max sup \C f (Xi,Yi) - E[C f (X, Y))\\\ 

i=X,...,n f eC "Vi 



Since Cf(X, Y) = 2s(f(X) ~ f{X)) + (f c (X) - f(X))(2f (X) - f{X) - f(X)), we 
have using Assumption [I] 



E[L f (X,Y) 2 ]<4a 2 e \\f-f\\l, 



(Px) 



+ ^E[{f{X) - f(X))*]y/E[(2f Q (X) - f(X) fC(X)n 

If U f := (f c (X) - f(X)) 2 we have ||C^||^ = \\ f - < (26) 2 for any / e C, so 
using the V>i version of Bernstein's inequality (see van der Vaart and Wellner ( 1996| |), 
we have that P(\Uj — E(f//)| > J7i||f// H^) < 2exp(— cmin(m, m 2 )) = 2exp(— cm) for 
any m G N — {0}. But for such a random variable, one has E(U^) 1 ^ P < c p E(Uf) for 
any p > 1 (cf. |Mendelson (2004)). So, in particular for p = 2, we derive 



^E[(/C(X)-/(X))4] < c||/ - / C || 2 L2(Px) . 
Moreover, since E(Z 4 ) < 16||Z||i , we have 



yjE[(2f (X)-f(X)-fC(X))4] < 8b 2 . 
So, we can conclude that 

a(C) 2 < (4<r 2 + 8cb 2 )d(F). 

Since E(Z) < \\Z\\^, we have b n (C) < 21og(n+ 1)|| sup /eC \£f(X, Y)\\\^, 1 . Moreover, 
a straightforward calculation gives C f (X,Y) < e 2 + (f c (X) - f a (X)) 2 + 3{f(X) - 
MX)) 2 , so 

b n {C) < 101og(n+ 1)6 2 . 
Putting all this together, and using Lemma [2] we arrive at 



Z < c(a e + b) 



y ( logM + x)logn max ^ (logM + sQbgn ^ 



with probability larger than 1 — Ae~ x for any x > 0. In the bounded case where 
maxdlYlloo, supj gF ||/||oo) < b, the proof is easier, and one can use the original 
Talagrand's concentration inequality. □ 

Lemma 3. LetC f {X,Y) = (Y~f(X)) 2 ~(Y~f F (X)) 2 . If we Ao«emax(||e||^ 3 , || sup /£F \f(X)- 
fo(X)\\\^ 2 ) < b we have, with probability larger than 1 — Ae~ x , that for every f S F: 



1 ™ 

-J2^f(Xi,Y{)-EC f (X,Y) 



I n 

i=l 



< c(a 6 + b ) ] i ^ M+x ^ n max (^ (lQgM+ ; )log " , ii/ - n) 
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Also, with probability at least 1 — 4e x , we have for every f, g G F: 

\\\f-g\\l-\\f-9\\ 2 \ 



(log M + x) log n r / (log A/ + x) log n 
< c6W max W , ||/ - g\ 



When maxdlKHoo, sup^ gi? |!/||oo) < ftaue, imfft probability larger than 1 — 2e x , 

£/ia£ for every f G f; 



I - £ *) - EC f (X, Y) < cbJ^±^ max ( b JW£+£, y _ ^ || 

i—l 

and with probability at least 1 — 2e~ x , that for every /, <? G F: 

\\\t 112 II f n a | ^ u fiogM + x ( / logAf +~x \ 
III/-3L - \\f-9\\ | < d>\J max^y A\f ~9\\)- 

Proof of Lemma [#| The proof uses exactly the same arguments as that of Lemma [2] 
and Corollary [l] and thus is omitted. □ 

Lemma 4. Let F\ be given by (|5| and recall that f F G argminj gi? R{f) &nd let 
d(F\) = diam(_Fi, L2(Px))- Assume that 

max(|| e ||^ 2 , || sup \f(X) - /o(X)||U 3 ) < b. 
feF 

Then, with probability at least 1 — 4exp(— x), we have f F G Pi, and any function 
f € F\ satisfies 

R(f) < R(.n + c((7 e + &) y (l0SM + x)1 ° Sn max (1 ° g M ± x) l ° g " , d(Fx)) . 

//maxdlyHoo, su P/eF ll/lloo) < &j we /iawe with probability at least 1 — 2exp(— x) that 
f G F\, and any function f € F\ satisfies 

logM + a; _ ^JlogM + x 



R{f) < R{f ) + cb\ — max 6W -5 — , d(F0 



Proof. The proof follows the lines of the proof of Lemma 4.4 in |Lecue and Mendelson| 
(2009a), together with Lemma [3] so we don't reproduce it here. □ 



B Function spaces 

In this section we give precise definitions of the spaces of functions considered in the 
paper, and give useful related results. The definitions and results presented here can 



be found in Triebel ( 2006 ) , in particular in Chapter 5 which is about anisotropic 
spaces, anisotropic multiresolutions, and entropy numbers of the embeddings of such 
spaces (see Section 5.3.3) that we use in particular to derive condition (Cp), for the 
anisotropic Besov space, see Section [3] 
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Let {ei, . . . ,ed} be the canonical basis of M. d and s = (si, . . . , Sd) with Sj > 
be a vector of directional smoothness, where Sj corresponds to the smoothness in 



direction ej. Let us fix 1 < p,(j < oo. If / : 



we define A?/ as the 



difference of order fc > 1 and step h £ Mr, given by A^/(x) = f(x + h) — f(x) and 
A*/(z) = Ai(A*- 1 /)(a:) for any x € M d . 

Definition 3. We say that f £ L p (K d ) belongs to the anisotropic Besov space 
Bp q (M. d ) if the semi-norm 



l/li 



E 



(/ (t-^||A t fc 4/|| p )« 



dt\ 1 /i 



t 

CO _) 



is finite (with the usual modifications when p = oo or q 
We know that the norms 

ll/lli?., ^Il/llp + l/l*;., 

are equivalent for any choice of fc.; > s^. An equivalent definition of the seminoma 
can be given using the directional differences and the anisotropic distance, see The- 



orem 5.8 in Triebel (20061 



Several explicit particular cases for the space are of interest. If s = (s, . . . , s) 
for some s > 0, then B^ is the standard isotropic Besov space. When p = q = 2 and 



, Sd) has integer coordinates, -Bf 2 is the anisotropic Sobolev space 



Bi 



\d s *f 



i=l 



\dx° 



< oo 



If s has non-integer coordinates, then Bf 2 is the anisotropic Bessel-potential space 



H s = 



£ L 



d 

E 



(i + l6l 2 ) Si/2 /(0 



< OO 



As we mentioned below, Assumption [4] is satisfied for barely all smoothness spaces 
considered in nonparametric literature. In particular, if T — B^ is the anisotropic 
Besov space defined above, (Cp) is satisfied: it is a consequence of a more general 
Theorem (see Theorem 5.30 in |Triebel| p06|) concerning the entropy numbers of 
embeddings (see Definition 1.87 in Triebel (2006)). Here, we only give a simplified 



version of this Theorem, which is sufficient to derive (Cp) for B*. Indeed, if one 
takes s = s, po = p, qo = Q and s 1 = 0, po = oo, q = oo in Theorem 5.30 from 



Triebel (2006), we obtain the following 



Theorem 9. Let 1 < p, q < oo and s = (si, . . . , Sd) where Si > 0, and let s be the 
harmonic mean of s (see (13)). Whenever s > d/p, we have 



B s p q c C(R d ), 

where C(M. d ) is the set of continuous functions on M. d , and for any S > 0, the sup- 
norm entropy of the unit ball of the anisotropic Besov space, namely the set 



U^ q := {/ £ B* M : \f\ B . 



<1} 



30 



satisfies 

H O0 (5,U;j<D5- s / d , 
where D > is a constant independent of S. 



(28) 



For the isotropic Sobolev space, Theorem [9] was obtained in the key paper [Birman 
and Solomjak (1967) (see Theorem 5.2 herein), and for the isotropic Besov space, it 
can be found, among others, in Birge and Massart (2000) and Kerkyacharian and 



Picard (2003) 



Remark 3. A more constructive computation of the entropy of anisotropic Besov 
spaces can be done using the replicant coding approach, which is done for Besov 
bodies in Kerkyacharian and Picard ( 2003 1 . Using this approach together with an 



anisotropic multircsolution analysis based on compactly supported wavelets or atoms, 
see Section 5.2 in Triebel (20061), we can obtain a direct computation of the entropy. 



The idea is to do a quantization of the wavelet coefficients, and then to code them 
using a replication of their binary representation, and to use 01 as a separator (so 
that the coding is injective). A lower bound for the entropy can be obtained as an 
elegant consequence of Hocffding's deviation inequality for sums of i.i.d. variables 
and a combinatorial lemma. 
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